nsys perf and eval by malay-nagda · Pull Request #2675 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-03-06T08:25:06Z

What does this PR do ?

Evaluate perf between given boundaries.

Changelog

start = performance_config.get("eval_time_start_step")
    if start is None:
        start = max(1, int(len(steps) * performance_config.get("skip_first_percent_time", 0.1)))
    end = performance_config.get("eval_time_end_step")
    performance_result["metrics"]["current_avg_iter_time_ms"] = float(np.nanmean(current_iter_time_values[start:end]))
    performance_result["metrics"]["golden_avg_iter_time_ms"] = float(np.nanmean(golden_iter_time_values[start:end]))

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added --eval_time_start_step and --eval_time_end_step configuration options for performance evaluation. These parameters enable users to specify precise evaluation windows for timing averages, providing more flexible control over which test steps are included in performance analysis.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-03-06T08:25:10Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

ko3n1g

Should we just refactor this to only use eval window instead of skip_n_steps? I'm wondering if there will be a case in future where we still need skip_n_steps?

malay-nagda · 2026-03-06T12:36:36Z

Should we just refactor this to only use eval window instead of skip_n_steps? I'm wondering if there will be a case in future where we still need skip_n_steps?

Wouldn't it be useful in case of variable number of steps like for convergence testing or when the set number of steps were not completed due to time limit or some crash- skip_n_percent can still calculate based on available steps?

coderabbitai · 2026-03-09T20:26:13Z

📝 Walkthrough

Walkthrough

These changes introduce configurable timing window boundaries for performance evaluation. Two new CLI arguments (--eval_time_start_step and --eval_time_end_step) enable users to specify explicit step ranges for GPU utilization and iteration time averaging, overriding the previous percentage-based skipping logic when provided.

Changes

Cohort / File(s)	Summary
CLI Arguments `scripts/performance/argument_parser.py`	Added `--eval_time_start_step` and `--eval_time_end_step` integer arguments to configure the timing evaluation window boundaries (0-indexed, start inclusive, end exclusive).
Configuration Propagation `scripts/performance/setup_experiment.py`	Integrated the new timing window arguments into the performance configuration structure, passing them through to the performance_params dictionary when set.
Evaluation Logic `scripts/performance/utils/evaluate.py`	Implemented windowing logic using the new start/end step indices for GPU utilization and iteration time averaging, replacing the previous percentage-based skip calculation and updating corresponding log messages to reflect explicit step ranges.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR modifies core performance evaluation logic affecting metric calculations (avg_iter_time_ms, gpu_util_values) but provides no documented test results or regression testing evidence.	Add test results documenting new eval_time_start_step/eval_time_end_step parameters work correctly and provide regression test results confirming no behavioral changes to existing functionality.
Title check	❓ Inconclusive	The title 'nsys perf and eval' is vague and uses abbreviated terms that don't clearly convey the specific change being made.	Use a more descriptive title that explains the main change, such as 'Add configurable evaluation window boundaries for performance metrics' or 'Support eval_time_start_step and eval_time_end_step in performance evaluation'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/nsys_perf_eval

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

scripts/performance/utils/evaluate.py (2)

684-689: Duplicated window calculation logic.

The start/end calculation at lines 684-687 duplicates the logic from validate_performance (lines 328-331). Consider extracting a helper function to ensure consistent behavior and reduce maintenance burden.

♻️ Proposed helper function extraction

Add a helper function at module level:

def _get_eval_window(config: Dict[str, Any], num_steps: int) -> tuple[int, int | None]:
    """Compute (start, end) indices for the evaluation window.
    
    Args:
        config: Performance config dict with optional eval_time_start_step,
                eval_time_end_step, and skip_first_percent_time keys.
        num_steps: Total number of steps.
    
    Returns:
        Tuple of (start_index, end_index) where end_index may be None.
    """
    start = config.get("eval_time_start_step")
    if start is None:
        start = max(1, int(num_steps * config.get("skip_first_percent_time", 0.1)))
    end = config.get("eval_time_end_step")
    return start, end

Then use it in both locations:

-    start = config.get("eval_time_start_step")
-    if start is None:
-        start = max(1, int(len(steps) * config["skip_first_percent_time"]))
-    end = config.get("eval_time_end_step")
+    start, end = _get_eval_window(config, len(steps))
     current_stable = current_gpu_util_values[start:end]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/evaluate.py` around lines 684 - 689, Extract the
duplicated start/end window logic into a module-level helper function (e.g.,
_get_eval_window(config: Dict[str, Any], num_steps: int) -> tuple[int,
Optional[int]]) that implements the same behavior as the duplicated blocks (use
config.get("eval_time_start_step"), fallback to max(1, int(num_steps *
config.get("skip_first_percent_time", 0.1))), and return eval_time_end_step as
end or None); then replace the duplicated logic in evaluate.py (the block
computing start/end before computing current_avg_iter_time_ms and
golden_avg_iter_time_ms) and the logic inside validate_performance with calls to
_get_eval_window(performance_config, len(steps)) to ensure consistent behavior.

328-333: Consider validating start and end bounds.

The current implementation doesn't validate that:

start and end are non-negative (negative indices have different Python slice semantics)
start < end (empty slice would cause np.nanmean to return nan)

While unlikely in practice, invalid CLI inputs could produce confusing behavior.

🛡️ Optional: Add bounds validation

     start = config.get("eval_time_start_step")
     if start is None:
         start = max(1, int(len(steps) * config["skip_first_percent_time"]))
+    elif start < 0:
+        raise ValueError(f"eval_time_start_step must be non-negative, got {start}")
     end = config.get("eval_time_end_step")
+    if end is not None and end < 0:
+        raise ValueError(f"eval_time_end_step must be non-negative, got {end}")
+    if end is not None and start >= end:
+        raise ValueError(f"eval_time_start_step ({start}) must be less than eval_time_end_step ({end})")
     current_stable = current_gpu_util_values[start:end]
     golden_stable = golden_gpu_util_values[start:end]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/evaluate.py` around lines 328 - 333, Validate and
clamp the computed start and end before slicing current_gpu_util_values and
golden_gpu_util_values: ensure start and end are integers >= 0, clamp them to
the valid range (0 .. len(steps)), and check start < end (raise a ValueError or
return a clear error) so you don't produce negative-index slices or empty
ranges; update the logic around config, start, end, steps,
current_gpu_util_values and golden_gpu_util_values to perform these checks and
fail fast with a clear message if inputs are invalid.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/utils/evaluate.py`:
- Around line 684-689: Extract the duplicated start/end window logic into a
module-level helper function (e.g., _get_eval_window(config: Dict[str, Any],
num_steps: int) -> tuple[int, Optional[int]]) that implements the same behavior
as the duplicated blocks (use config.get("eval_time_start_step"), fallback to
max(1, int(num_steps * config.get("skip_first_percent_time", 0.1))), and return
eval_time_end_step as end or None); then replace the duplicated logic in
evaluate.py (the block computing start/end before computing
current_avg_iter_time_ms and golden_avg_iter_time_ms) and the logic inside
validate_performance with calls to _get_eval_window(performance_config,
len(steps)) to ensure consistent behavior.
- Around line 328-333: Validate and clamp the computed start and end before
slicing current_gpu_util_values and golden_gpu_util_values: ensure start and end
are integers >= 0, clamp them to the valid range (0 .. len(steps)), and check
start < end (raise a ValueError or return a clear error) so you don't produce
negative-index slices or empty ranges; update the logic around config, start,
end, steps, current_gpu_util_values and golden_gpu_util_values to perform these
checks and fail fast with a clear message if inputs are invalid.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1756638-dd38-4019-891b-ec4db2365970

📥 Commits

Reviewing files that changed from the base of the PR and between d740eee and 9458a00.

📒 Files selected for processing (3)

scripts/performance/argument_parser.py
scripts/performance/setup_experiment.py
scripts/performance/utils/evaluate.py

nsys perf and eval

9477af5

Signed-off-by: Malay Nagda <malayn@nvidia.com>

nsys perf and eval

876ec72

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda marked this pull request as ready for review March 6, 2026 09:27

malay-nagda requested review from a team and erhoo82 as code owners March 6, 2026 09:27

malay-nagda requested a review from ko3n1g March 6, 2026 09:28

copy-pr-bot bot temporarily deployed to test March 6, 2026 09:28 Inactive

ko3n1g reviewed Mar 6, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 12:25 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 12:37 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 12:51 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 6, 2026 12:51 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 12:51 Inactive

malay-nagda requested a review from ko3n1g March 9, 2026 15:29

ko3n1g approved these changes Mar 9, 2026

View reviewed changes

Merge branch 'main' into malay/nsys_perf_eval

9458a00

copy-pr-bot bot temporarily deployed to test March 9, 2026 20:21 Inactive

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 20:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 9, 2026 20:45 Failure

malay-nagda merged commit f532fcf into main Mar 10, 2026
24 of 26 checks passed

malay-nagda deleted the malay/nsys_perf_eval branch March 10, 2026 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsys perf and eval#2675

nsys perf and eval#2675
malay-nagda merged 3 commits intomainfrom
malay/nsys_perf_eval

malay-nagda commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

ko3n1g left a comment

Uh oh!

malay-nagda commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

malay-nagda commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

ko3n1g left a comment

Choose a reason for hiding this comment

Uh oh!

malay-nagda commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

malay-nagda commented Mar 6, 2026 •

edited by coderabbitai bot

Loading